Joins


Now that the tables are cleaned and the synonyms are taken care of, we can join the three tables into one big table. One consideration when joining the tables, was ensuring that songs with the same name (but are actually different songs as indicated by the artist) were not combined. We used the song_title and the artist columns as parameters to prevent that mishap from occuring.

Examples of Song Titles

Here is a brief example of a song with the same name but different artists.

Show the code
spotify_cleaned |>
  filter(song_title == "Cool for the Summer")
Show the code
tiktok_cleaned |>
  filter(song_title == "Cool for the Summer")
Spotify
SELECT *
FROM spotify_cleaned
WHERE song_title = 'Cool for the Summer'
1 records
song_title artist rank_s tempo
Cool for the Summer Demi Lovato 61 114.06
TikTok
SELECT *
FROM tiktok_cleaned
WHERE song_title = 'Cool for the Summer'
1 records
song_title artist rank_t tempo
Cool for the Summer Alex D’Rosso 49 136.01
Spotify
spotify_cleaned.query("song_title == 'Cool for the Summer'")
              song_title       artist  rank_s   tempo
360  Cool for the Summer  Demi Lovato      61  114.06
TikTok
tiktok_cleaned.query("song_title == 'Cool for the Summer'")
              song_title        artist  rank_t   tempo
182  Cool for the Summer  Alex D'Rosso      49  136.01

Joins/Merges

We used a full_join and joined by song_title and artist. A full_join meant that all observations from both tables would be kept. We wanted to keep all observations even if it meant that some songs would have NA values for the rankings on the other charts.

# Join the TikTok and Spotify data by song title and call the new table s_and_t.
s_and_t <- spotify_cleaned |>
  full_join(tiktok_cleaned, by = join_by(song_title, artist))
s_and_t
# Join the s_and_t table with the billboard table and call the new dataframe combined.
combined <- s_and_t |>
  full_join(billboard_cleaned, by = join_by(song_title, artist))
combined

Because both the Spotify and TikTok tables had a column called tempo, the resulting table had two columns for tempo: tempo.x and tempo.y. We used coalesce to combine the tempo columns into one since the tempo for an individual song would be the same no matter which chart it appears on. coalesce takes the tempo from the first non-missing value in the tempo.x and tempo.y columns and puts value in the new tempo column.

combined_cleaned <- combined |>
  mutate(tempo = coalesce(tempo.x, tempo.y), .keep = "unused")
combined_cleaned

We tried two different methods for joining the tables in SQL. This first method pipes the code blocks together. We used a FULL OUTER JOIN to join by song_title and artist. The FULL OUTER JOIN combines all records from both tables. It joins the data where there are matches and fills in null for the missing data. COALESCE is used here to combine the tempo, song titles, and artists.

-- combine the spotify and tiktok tables
CREATE OR REPLACE TABLE s_and_t AS 
    SELECT *, 
           COALESCE(tiktok_cleaned.tempo, spotify_cleaned.tempo) AS tempo_st, 
           COALESCE(tiktok_cleaned.song_title, spotify_cleaned.song_title) AS song_title_st, 
           COALESCE(tiktok_cleaned.artist, spotify_cleaned.artist) AS artist_st
    FROM spotify_cleaned
    FULL OUTER JOIN tiktok_cleaned ON tiktok_cleaned.song_title = spotify_cleaned.song_title AND tiktok_cleaned.artist = spotify_cleaned.artist
;

-- combine the spotify and tiktok table with the billboard table
CREATE OR REPLACE TEMP TABLE s_t_b AS
    SELECT *, 
           COALESCE(s_and_t.song_title_st, billboard_cleaned.song_title) AS song_title_real, 
           COALESCE(s_and_t.artist_st, billboard_cleaned.artist) AS artist_real
    FROM s_and_t
    FULL OUTER JOIN billboard_cleaned ON billboard_cleaned.song_title = s_and_t.song_title_st AND billboard_cleaned.artist = s_and_t.artist_st
;
-- select only the relevant columns
CREATE OR REPLACE TABLE combined_cleaned_piped AS    
  SELECT song_title_real AS song_title, 
         artist_real AS artist, 
         rank_s, 
         rank_t, 
         rank_b, 
         tempo_st AS tempo
  FROM s_t_b

The second method uses WITH. The code creates several temporary tables that are then queried. The result of the query is stored as a new table.

--combine the spotify and tiktok tables
CREATE OR REPLACE TABLE combined_cleaned AS
  WITH s_t AS (
      SELECT *, 
             COALESCE(tiktok_cleaned.tempo, spotify_cleaned.tempo) AS tempo_st, 
             COALESCE(tiktok_cleaned.song_title, spotify_cleaned.song_title) AS song_title_st, 
             COALESCE(tiktok_cleaned.artist, spotify_cleaned.artist) AS artist_st
      FROM spotify_cleaned
      FULL OUTER JOIN tiktok_cleaned ON tiktok_cleaned.song_title = spotify_cleaned.song_title AND tiktok_cleaned.artist = spotify_cleaned.artist
  ), s_t_b AS (
--combine the spotify and tiktok table with the billboard table
     SELECT *, 
             COALESCE(s_and_t.song_title_st, billboard_cleaned.song_title) AS song_title_real, 
             COALESCE(s_and_t.artist_st, billboard_cleaned.artist) AS artist_real
      FROM s_and_t
      FULL OUTER JOIN billboard_cleaned ON billboard_cleaned.song_title = s_and_t.song_title_st AND billboard_cleaned.artist = s_and_t.artist_st
  )

SELECT song_title_real AS song_title, 
         artist_real AS artist, 
         rank_s, 
         rank_t, 
         rank_b, 
         tempo_st AS tempo
FROM s_t_b

The Python version of this operation uses an outer .merge to combine the tables so that all observations are accounted for. The code is read “inside-out,” so first the Spotify and TikTok tables are merged, then the Billboard table is merged with the result and called combined. In the process of merging, two new tempo columns (tempo_x and tempo_y) were created. We used .combine_first (very similar to COALESCE is SQL), which combines two dataframe objects by replacing a null value with a non-null value.

#combine the three datasets
combined = pd.merge(billboard_cleaned, 
                   (pd.merge(spotify_cleaned, 
                             tiktok_cleaned, 
                             how='outer', 
                             on=['song_title', 'artist'])), 
                    how='outer', 
                    on=['song_title', 'artist'])

#combine the tempo columns
combined = (combined
        .assign(tempo = lambda df_: df_.tempo_x.combine_first(combined.tempo_y))
        .drop(columns = ['tempo_x', 'tempo_y'])
        [['song_title', 'artist', 'rank_s', 'rank_t', 'rank_b', 'tempo']]
)
combined
                                      song_title  ...    tempo
0                                 'Til You Can't  ...      NaN
1                               'Till I Collapse  ...  171.447
2    1, 2, 3 (feat. Jason Derulo & De La Ghetto)  ...   94.968
3    1, 2, 3 (feat. Jason Derulo & De La Ghetto)  ...   94.968
4                     10 Things I Hate About You  ...  153.967
..                                           ...  ...      ...
924                                      traitor  ...  100.607
925            up at night (feat. justin bieber)  ...  100.012
926                                    vice city  ...   87.023
927                           you broke me first  ...  124.148
928                                        İmdat  ...  133.967

[929 rows x 6 columns]